Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear all documents without dropping schema #33

Conversation

jaeyson
Copy link
Owner

@jaeyson jaeyson commented Jul 3, 2024

Implements #28

Problem

Based on the Typesense issue, it seems that the only way to delete all documents in a collection is to simply delete the collection and recreate it.

Workaround

Implement a logic, preferably a catch-all filter if no query filter_by is passed in function.

So, 2 possible ways to use the function:

  • if no custom filter was passed, it'll use a default filter (catch-all), or
  • if there's a filter, use this option.

To delete multiple documents by ID, you can use filter_by=id: [id1, id2, id3].

To delete all documents in a collection, you can use a filter that matches all documents in your collection. For eg, if you have an int32 field called popularity in your documents, you can use filter_by=popularity:>0 to delete all documents. Or if you have a bool field called in_stock in your documents, you can use filter_by=in_stock:[true,false] to delete all documents.

@jaeyson jaeyson added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 3, 2024
@jaeyson jaeyson self-assigned this Jul 3, 2024
@@ -294,7 +294,8 @@ defmodule ExTypesense.Collection do
@doc """
Permanently drops a collection by collection name or module name.

**Note**: dropping a collection does not remove the referenced alias, only the indexed documents.
**Note**: dropping a collection does not remove the referenced
alias, only the indexed documents.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is this really correct? We drop the collection altogether right, not just the indexed documents?

Copy link
Owner Author

@jaeyson jaeyson Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shijithkjayan yes everything is removed when collection is dropped except alias (you have to explicitly delete the alias).

Here are the details:

  • dropping collection will also delete all documents
  • dropping collection will not affect alias
  • removing all documents will not drop the collection

The reason why alias is not removed when dropping collection is because of this example scenario:

A new collection (including its documents) was created and it replaces the dropped collection and points it to the existing alias.

  • This is especially helpful when you need to update the schema of a collection. You'd create a new timestamped collection, re-index your data in it and then swap the alias to point to it. source
  • You want to use the Collection Alias feature to do a reindex without any downtime into a new collection and then switch the alias over to the new collection. source

see test/collection_test.exs for the updates. LMK your thoughts on this.

Copy link
Contributor

@shijithkjayan shijithkjayan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete_by_query feature LGTM

@jaeyson jaeyson merged commit 36eef5d into main Jul 13, 2024
3 checks passed
def delete_all_documents(conn \\ Connection.new(), collection_name)

def delete_all_documents(conn, collection_name) when is_binary(collection_name) do
delete_documents_by_query(conn, collection_name, %{filter_by: "#{collection_name}_id:>=0"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: In my case, the ID field of a document is called id and not {collection_name}_id. And I believe in most cases the collection name will not be prefixed to the id field name. So is this a conscious decision you made to prefix the collection name here?

Also, in the github comment thread you attached in the function doc, it says that id>=0 will not work as the ID is a string field. I am sorry I missed this out while reviewing the changes. What do you think can be done here instead? @jaeyson

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a query that can work is %{filter_by: "id:!=0"}. This way it doesnt matter that the id is not an integer.

Copy link
Owner Author

@jaeyson jaeyson Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'll make changes to it.

In my case, the ID field of a document is called id and not {collection_name}_id. And I believe in most cases the collection name will not be prefixed to the id field name. So is this a conscious decision you made to prefix the collection name here?

The prefixed id is used from using search that returns an Ecto query (see: https://github.com/jaeyson/ex_typesense/blob/main/lib/ex_typesense/result_parser.ex#L1-L31)

it says that id>=0 will not work as the ID is a string field.

Yes, you're correct. Id in Typesense is a string, AFAIK the returned JSON response is like this (I'm basing on the docs and tested it):

{
  "id": "124",
  "company_name": "Stark Industries",
  "num_employees": 5215,
  "country": "USA"
}

To delete all documents in a collection, you can use a filter that matches all documents in your collection. For eg, if you have an int32 field called popularity in your documents, you can use filter_by=popularity:>0 to delete all documents

So the change here is from %{filter_by: "#{collection_name}_id:>=0"} to %{filter_by: "id:!=0"} just like you mentioned earlier.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shijithkjayan I forgot this part:

To delete all documents in a collection, you can use a filter that matches all documents in your collection. For eg, if you have an int32 field called popularity in your documents, you can use filter_by=popularity:>0 to delete all documents.

That's one of the reason why I used {collection_name}_id (type int32) instead if id (type string).

There's a thread discussion where a wildcard is not yet added as a feature:
https://threads.typesense.org/t/deleting-typesense-documents-with-specific-field-using-filter-by-/2J3912e

One problem here is if your PK in Ecto schema is a type of UUID (<-- I'll add this in my todo list)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if its UUID, %{filter_by: "id:!=0"} will still work right?

Copy link
Owner Author

@jaeyson jaeyson Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id:!=0

@shijithkjayan I'm afraid not. Encountered an error by specifically doing that (id:!=0).

see this CI test: https://github.com/jaeyson/ex_typesense/actions/runs/10260778543/job/28387381407

This error was discussed here, and another one here:

# this error only pops-up on versions below v26
`{:error, "Not equals filtering is not supported on the `id` field."}` <-- `!=` not supported

If you read this thread https://threads.typesense.org/2J3912e:

Can i delete that documents using filter_by: "user_id:*"
It's not possible to do a wildcard match on a field with filter by.

That's the reason why id (or another field) won't work at the moment when using wildcard for filter_by.

uuid

Even if its UUID

The reason why I brought up uuid is because currently I'm prefixing collection name (table name in Ecto) in _id and using it as default_sorting_field

Basing on their docs https://typesense.org/docs/26.0/api/collections.html#schema-parameters, it says in default_sorting_field:

The name of an int32 / float field that determines the order in which the search results are ranked when a sort_by clause is not provided during searching.

So I'm not sure if a default_sorting_field with type string (uuid) would work in this case.

Hope this answers your questions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually able to clear all documents in my collection with %{filter_by: "id:!=0"} 🤔

ExTypesense.delete_documents_by_query(collection_name, %{filter_by: "id:!=0"}) --> This is what I used and it cleared everything with no errors

Copy link
Owner Author

@jaeyson jaeyson Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shijithkjayan what typesense version are you using?

This error {:error, "Not equals filtering is not supported on the id field."} is from using Typesense v0.25.2 on the test run result https://github.com/jaeyson/ex_typesense/actions/runs/10260778543/job/28387381407

The reason I don't want to continue to make on that change (id:!=0) is because it's not supported on v0.25.2.

I wanted this library to be backwards-compatible like atleast 1 or 2 versions below. I'll gladly change this based on your request once the version of Typesense bumps up, and not breaking on previous versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 👍 Thanks for clarifying

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Add ability to clear all documents from a collection without dropping the collection and schema
2 participants